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“Have you found it, Wiggins?” 


[Sherlock Holmes in A Study in Scarlet] 


Abstract 

Detecting new information and events in a dynamic network by probing individual nodes has many 
practical applications: discovering new webpages, analyzing influence properties in network, and detect¬ 
ing failure propagation in electronic circuits or infections in public drinkable water systems. In practice, 
it is infeasible for anyone but the owner of the network (if existent) to monitor all nodes at all times. 
In this work we study the constrained setting when the observer can only probe a small set of nodes at 
each time step to check whether new pieces of information (items) have reached those nodes. 

We formally define the problem through an infinite time generating process that places new items 
in subsets of nodes according to an unknown probability distribution. Items have an exponentially 
decaying novelty, modeling their decreasing value. The observer uses a probing schedule (i.e., a probability 
distribution over the set of nodes) to choose, at each time step, a small set of nodes to check for new items. 
The goal is to compute a schedule that minimizes the average novelty of undetected items. We present 
an algorithm, WIGGINS, to compute the optimal schedule through convex optimization, and then show 
how it can be adapted when the parameters of the problem must be learned or change over time. We also 
present a scalable variant of WIGGINS for the MapReduce framework. The results of our experimental 
evaluation on real social networks demonstrate the practicality of our approach. 


1 Introduction 

Many applications require the detection of events in a network as soon as they happen or shortly thereafter, 
as the value of the information obtained by detecting the events decays rapidly as time passes. For example, 
an emerging trend in algorithmic stock trading is the use of automatic search through the Web and social 
networks for pieces of information that can be used in trading decisions before they appear in the more 
popular news sites HOI m using. Similarly, intelligence, business and politics analysts are scanning online 
sources for new information or rumors. While new items are often reblogged, retweeted, and posted on 
a number of sites, it is sufficient to find an item once, as fast as possible, before it loses its relevance or 
freshness. There is no benefit in seeing multiple copies of the same news item or rumor. This is also the case 
when monitoring for intrusions, infections, or defects in, respectively, a computer network, a public water 
system, or a large electronic circuit. 

Monitoring for new events or information is a fundamental search and detection problem in a distributed 
data setting, not limited to social networks or graph analysis. In this setting, the data is distributed among a 
large number of nodes, and new items appear in individual nodes (for example, as the products of processing 
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the data available locally at the node), and may propagate (being copied) to neighboring nodes on a physical 
or a virtual network. The goal is to detect at least one copy of each new item as soon as possible. The 
search application can access any node in the system, but it can only probe (i.e., check for new items on) a 
few nodes at a time. To minimize the time to find new items, the search application needs to optimize the 
schedule of probing nodes, taking into account (i) the distribution of copies of items among the nodes (to 
choose which nodes to probe), and (ii) the decay of the items’ novelty (or relevance/freshness) over time (to 
focus the search on most relevant items). The main challenge is how to devise a good probing schedule in 
the absence of prior knowledge about the generation and distribution of items in the network. 

Contributions In this work we study the novel problem of computing an optimal node probing schedule 
for detecting new items in a network under resource scarceness, i.e., when only a few nodes can be probed 
at a time. Our contributions to the study of this problem are the following: 

• We formalize a generic process that describes the creation and distribution of information in a network, 
and define the computational task of learning this process by probing the nodes in the network according 
to a schedule. The process and task are parametrized by the resource limitations of the observer and 
the decay rate of the novelty of items. We introduce a cost measure to compare different schedules: the 
cost of a schedule is the limit of the average expected novelty of uncaught items at each time step. On 
the basis of these concepts, we formally define the Optimal Probing Schedule Problem, which requires 
to find the schedule with minimum cost. 

• We conduct a theoretical study of the cost of a schedule, showing that it can be computed explicitly and 
that it is a convex function over the space of schedules. We then introduce wiGGiNsQan algorithm to 
compute the optimal schedule by solving a constrained convex optimization problem through the use 
of an iterative method based on Lagrange multipliers. 

• We discuss variants of wiGGiNS for the realistic situation where the parameters of the process needs to 
be learned or can change over time. We show how to compute a schedule which is (probabilistically) 
guaranteed to have a cost very close to the optimal by only observing the generating process for a 
limited amount of time. We also present a MapReduce adaptation of WIGGINS to handle very large 
networks. 

• Finally, we conduct an extensive experimental evaluation of WIGGINS and its variants, comparing 
the performances of the schedules it computes with natural baselines, and showing how it performs 
extremely well in practice on real social networks when using well-established models for generating 
new items (e.g., the independence cascade model ini)- 

To the best of our knowledge, the problem we study is novel and we are the first to devise an algorithm 
to compute an optimal schedule, both when the generating process parameters are known and when they 
need to be learned. 

Paper Organization. In Sect. we give introductory definitions, and formally introduce the settings and 
the problem. We discuss related works in Sect. In Sect. |^we describe our algorithm wiGGiNS and its 
variants. The results of our experimental evaluation are presented in Sect. We conclude by outlining 
directions for future work in Sect. 

2 Problem Definition 

In this section we formally introduce the problem and define our goal. 

Let G = {V,E) be a graph with \V\ = n nodes. W.l.o.g. we let V = [n]. Let C 2^ be a collection 
of subsets of V, i.e., a collection of sets of nodes. Let tt be a function from if to [0,1] (not necessarily a 
probability distribution). We model the generation and diffusion of information in the network by defining 
a generating process T = (E, tt). T is a infinite discrete-time process which, at each time step t, generates a 

^In the Sherlock Holmes novel A study in scarlet by A. Conan Doyle, Wiggins is the leader of the “Baker Street Irregulars”, 
a band of street urchins employed by Holmes as intelligence agents. 
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collection of sets TtQT such that each set S' G is included in Xt with probability 7r(S), independently of 
t and of other sets generated at time t and at time t' < t. For any t and any S G It, the ordered pair (t, S) 
represents an item - a piece of information that was generated at time t and reached instantaneously the nodes 
in S. We choose to model the diffusion process as instantaneous because this abstraction accurately models 
the view of an outside resource-limited observer that does not have the resources to monitor simultaneously 
all the nodes in the network at the fine time granularity needed to observe the different stages of the diffusion 
process. 

Probing and schedule. The observer can only monitor the network by probing nodes. Formally, by probing 
a node v G V at time t, we mean obtaining the set I{t, v) of items {t', S) such that t' < t and v G sQ 

I{t,v) := {{t',S) : t' <t,S Gif,V G S} . 

Let Ut be the union of the sets If generated by F at any time t' < t, and so I{t, v) C Ut- 

We model the resource limitedness of the observer through a constant, user-specified, parameter c G N, 
representing the maximum number of nodes that can be probed at any time, where probing a node v returns 
the value I{t, v). 

The observer chooses the c nodes to probe by following a schedule. In this work we focus on memoryless 
schedules, i.e., the choice of nodes to probe at time t is independent from the choice of nodes probed at 
any time t' < t. More precisely, a probing c-schedule p is a probability distribution on V. At each time t, 
the observer chooses a set Pt of c nodes to probe, such that Pt is obtained through random sampling of V 
without replacement according to p, independently from Pf from t' < t. 

Caught items, uncaught items, and novelty. We say that an item (t', S) is caught at time t > t' iS 

1. a node v G S is probed at time t; and 

2. no node in S was probed in the interval [t',t — 1]. 

Let Ct be the set of items caught by the observer at any time t' <t. We have Ct C Ut- Let Nt = Ut \ Ct 
be the set of uncaught items at time t, i.e., items that were generated at any time t' < t and have not been 
caught yet at time t. For any item (t', S) G Nt, we define the 6-novelty of (t', S) at time t as 

fg{t,t',S) := 0 *-*', 

where 6 G (0,1) is a user-specified parameter modeling how fast the value of an item decreases with time if 
uncaught. Intuitively, pieces of information (e.g., rumors) have high value if caught almost as soon as they 
have appeared in the network, but their value decreases fast (i.e., exponentially) as more time passes before 
being caught, to the point of having no value in the limit. 

Load of the system and cost of a schedule. The set Nt of uncaught items at time t imposes a 6-load, 
Lg(f), on the graph at time t, defined as the sum of the 0-novelty at time t of the items in Nf. 

Le{t):= ^o{t,t',S) . 

{t',s)eNt 

The quantity Lg(t) is a random variable, depending both on F and on the probing schedule p, and as such 
it has an expectation ¥,[Lg(t)] w.r.t. all the randomness in the system. The 6-cost of a schedule p is defined 
as the limit, for t —)■ oo, of the average expected load of the system: 

coste(p) := lim ^ VE[Le(t)] 

t^OO t 

t'<t 

{tYS)^N,, 

^The set S appears in the notation for an item only for clarity of presentation: we are not assuming that when we probe a 
node and find an item (t, S) we obtain information about S. 
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Intuitively, the load at each time indicates the amount of novelty we did not catch at that time, and the cost 
function measures the average of such loss over time. The limit above always exists (Lemma [^. 

We now have all the necessary ingredients to formally define the problem of interest in this work. 
Problem definition. Let G = {V,E) be a graph and L = {E,7r) be a generating process on G. Let c € N 
and 6 S (0,1). The {9^ c)-Optimal Probing Schedule Problem ((0,c)-OPSP) requires to find the optimal 
c-schedule p*, i.e., the schedule with minimum 0-cost over the set Sc of c-schedules: 

p* = argmin{coste(p),p G Sc} . 

p 

Thus, the goal is to design a c-schedule that discovers the maximum number of items weighted by their 
novelty value (which correspond to those generated most recently). The parameter 9 controls how fast the 
novelty of an item decays, and influences the choices of a schedule. When 9 is closed to 0, items are relevant 
only for a few steps and the schedule must focus on the most recently generated items, catching them as soon 
as they are generated (or at most shortly thereafter). At the other extreme (0 « 1), an optimal schedule 
must maximizes the total number of discovered items, as their novelty decays very slowly. 

Viewing the items as “information” disseminated in the network, an ideal schedule assigns higher probing 
probability to nodes that act as information hubs, i.e., nodes that receive a large number of items. Thus, an 
optimal schedule p*, identifies information hubs among the nodes. This task (finding information hubs) can 
be seen as the complement of the influence maximization problem [IZllIH]. In the influence maximization 
problem we look for a set of nodes that generate information that reach most nodes. In the information hubs 
problem, we are interested in a set of nodes that receive the most of information, thus the most informative 
nodes for an observer. 

In the following sections, we may drop the specification of the parameters from 0-novelty, 0-cost, 0-load, 
and c-schedule, and from their respective notation, as the parameters will be clear from the context. 

3 Related Work 

The novel problem we focus on in this work generalizes and complements a number of problems studied in 
the literature. 

The “Battle of Water Sensor Network” challenge motivated a number of works on outbreak detection'. 
the goal is to optimally place static or moving sensors in water networks to detect contamination [niiinii2i]. 
The optimization can be done w.r.t. a number of objectives, such as maximizing the probability of detection, 
minimizing the detection time, or minimizing the size of the subnetwork affected by the phenomena m- 
A related work [T] considered sensors that are sent along fixed paths in the network with the goal of gath¬ 
ering sufficient information to locate possible contaminations. Early detection of contagious outbreaks by 
monitoring the neighborhood (friends) of a randomly chosen node (individual) was studied by Christakis 
and Fowler [7]. Krause et al. m present efficient schedules for minimizing energy consumption in battery 
operated sensors, while other works analyzed distributed solutions with limited communication capacities 
and costs [II1IIH1I22]. In contrast, our work is geared to detection in huge but virtual networks such as the 
Web or social networks embedded in the Internet, where it is possible to “sense” or probe (almost) any node 
at approximately the same cost. Still only a restricted number of nodes can be probed at each steps but the 
optimization of the probing sequence is over a much larger domain, and the goal is to identify the outbreaks 
(items) regardless of their size and solely by considering their interest value. 

Our methods complement the work on Emerging Topic Detection where the goal is to identify emergent 
topics in a social network, assuming full access to the stream of all postings. Providers, such as Twitter or 
Facebook, have an immediate access to all tweets or postings as they are submitted to their server |U . 
Outside observers need an efficient mechanism to monitor changes, such as the methods developed in this 
work. 

Web-crawling is another research area that study how to obtain the most recent snapshots of the web. 
However, it differs from our model in two key points: our model allows items to propagate their copies, and 
they will be caught if any of their copies is discovered (where snapshots of a webpage belong to that page 
only), and all the generated items should be discovered (and not just the recent ones) [5115^. 
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The goal of News and Feed Aggregation problem is to capture updates in news websites (e.g. by RSS 
feeds) pn ITKl UTO] , Our model differs from that setting in that we consider copies of the same news in 
different web sites as equivalent and therefore are only interested in discovering one of the copies. 


4 The WIGGINS Algorithm 

In this section we present the algorithm wiGGiNS (and its variants) for solving the Optimal Probing Schedule 
Problem (0,c)-OPSP for generating process P = (.F, tt) on a graph G = {V,E). 

We start by assuming that we have complete knowledge of P, i.e., we know E and tt. This strong 
assumption allows us to study the theoretical properties of the cost function and motivates the design of 
our algorithm, wiGGiNS, to compute an optimal schedule. We then remove the assumption and show how 
we can extend wiGGiNS to only use a collection of observations from P. Then we discuss how to recalibrate 
our algorithms when the parameters of the process (e.g., tt or F") change over time. Finally, we show an 
algorithm for the MapReduce framework that allows us to scale to large networks. 


4.1 Computing the Optimal Schedule 

We first conduct a theoretical analysis of the cost function costs. 


Analysis of the cost function Assume for now that we know P, i.e., we have complete knowledge of J- 
and TT. Under this assumption, we can exactly compute the 0-cost of a c-schedule. 

Lemma 1. Let p = (pi,..., p„) be a c-schedule. Then 


cost(p) := lim - 
*—>•00 t 


^E[Ls(0]=E 


*'=0 




l_0(l_p(5))c’ 


( 1 ) 


where p(S') = 

Proof. Let t be a time step, and consider the quantity E,[Lg{t)]. By definition we have 


E[Ls(t)] = E 


= E 

1 - 

I 

w 

L_ 




_{t',s)eNt 


where Nt is the set of uncaught items at time t. Let now, for any t' < t, Nt^t' U Nt be the set of uncaught 
items in the form (t', S). Then we can write 


E[Ls(t)] =E 




Define now, for each S G IF, the random variable which takes the value 0* ‘ if {t',S) G and 0 

otherwise. Using the linearity of expectation, we can write: 


t 

= E • (2) 

S&rt'=0 


The r.v. Xs,t,t' takes value 0* ‘ if and only if the following two events Ei and E 2 both take place: 
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• El', the set S' G F belongs to Xf, i.e., is generated by F at time t'; 

• E 2 '. the item (t', S) is uncaught at time t. This is equivalent to say that no node v € S was probed in 
the time interval 

We have Pr(Fi) = 7r(S), and 

Pr(F2) = (1 - p(S))^(‘-‘') . 

The events Ei and E 2 are independent, as the process of probing the nodes is independent from the process 
of generating items, therefore, we have 

Pr(Xs,t,t, = 0*-*') = Pr(Fi)Pr(F2) = ^(S)(l - p(S))^(‘-*') . 


We can plug this quantity in the rightmost term of ([^ and write 

t 

hm E[Lo{t)] = hm ^ ^ 0*-‘V(S)(l - p(S))^(‘-*') 

f. —^r>o f.—^nn * ^ ^ 




t'=0 

^(‘^) 


(3) 


where we used the fact that 9{1 — p(S))'^ < 1. We just showed that the sequence (E[Le(t)])tgN converges as 
t —)■ 00 . Therefore, its Cesaro mean, i.e., limt_,.oo | X]t'=o equals to its limit [T^ Sect. 5.4] and we 

have 

1 ^ 

coste(p) = lim - y^E[Le(t)] = lim E[Lg{t)] 

t—^oo t t—¥oo 

t'^0 

-^^l-0(l-p(5))^ ■ 


□ 


We now show that coste(p), as expressed by the r.h.s. of Q is a convex function over its domain Sc, the 
set of all possible c-schedules. We then use this result to show how to compute an optimal schedule. 

Theorem 1. The cost function costg{p) is a convex function over Sc- 

Proof. For any S G E, let 

^ 1 - 6»(1 - p(S'))= ■ 

The function cost6i(p) is a linear combination of /s(p)’s with positive coefficients. Hence to show that cost6((p) 
is convex it is sufficient to show that, for any S G E, fs{p) is convex. 

We start by showing that gs{p) = 9{1 — p(S'))''^ is convex. This is due to the fact that its Hessian matrix 
is positive semidefinite |Hj: 

^ n (n)=\ Me-1)(1 - P(>5'))''~^ bjeF 

dpidpj^^^ \ 0 otherwise 

Let vg be a n X 1 vector in K" such that its Tth coordinate is [c(c — 1)(1 — p(5'))''^“^] if i G F, and 0 
otherwise. We can write the Hessian matrix of gs as 

= Vg * Vg , 

and thus, '^^gs is positive semidefinite matrix and g is convex. From here, we have that 1 — (;g is a concave 
function. Since /g(p) = and the function h{x) = ^ is convex and non-increasing, then /g is a convex 

function. □ 
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If for every v G V, S = {?;} belongs to then the function gs in the above proof is strictly convex, and 
so is fs- 

We then have the following corollary of Thm. [l] 

Corollary 1. Any schedule p with locally minimum cost is an optimal schedule (i.e., it has global minimum 
cost). Furthermore, if for every v G V, {u} belongs to T, the optimal schedule is unique. 


The algorithm Corollary implies that one can compute an optimal c—schedule p* (i.e., solve the {9, c)- 
OPSP) by solving the unconstrained minimization of costg over the set Sc of all c-schedules, or equivalently 
by solving the following constrained minimization problem on K": 


min costfl (p) 

pGR" 

n 

= I 

i—1 

Pi > 0 Vi G {!,..., n} 


(4) 


Since the function costg is convex and the constraints are linear, the optimal solution can, theoretically, be 
found efficiently [5]. In practice though, available convex optimization problem solvers can not scale well 
with the number n of variables, especially when n is in the millions as is the case for modern graphs like 
online social networks or the Web. Hence we developed WIGGINS, an iterative method based on Lagrange 
multipliers 0 Sect. 5.1], which can scale efficiently and can be adapted to the MapReduce framework of 
computation [3], as we show in Sect. 4.4 While we can not prove that this iterative method always converges, 
we can prove (Thm. that (i) if at any iteration the algorithm examines an optimal schedule, then it will 
reach convergence at the next iteration, and (ii) if it converges to a schedule, that schedule is optimal. In 
Sect. we show our experimental results illustrating the convergence of wiGGiNS in different cases. 

WIGGINS takes as inputs the collection iF, the function tt, and the parameters c and 6, and outputs a 
schedule p which, if convergence (defined in the following) has been reached, is the optimal schedule. It starts 
from a uniform schedule p*'®^ i.e., p|*^^ = 1/n for all 1 < f < n, and iteratively refines it until convergence (or 
until a user-specified maximum number of iterations have been performed). At iteration j > 1, we compute, 
for each value i, 1 < i < n, the function 




E 

s.t. i^S 


6»c7r(S')(l-p(J-i)(S'))^-i 

(l-0(l-pO-i)(5))^)2 


(5) 


and then set 

0) ^ 

* ELipi^”'^vb.(pO-i)) ■ 

The algorithm then checks whether . If so, then we reached convergence and we can return 

in output, otherwise we perform iteration j + 1. The pseudocode for wiGGiNS is in Algorithm The 
following theorem shows the correctness of the algorithm in case of convergence. 


Theorem 2. We have that: 


1. if at any iteration j the schedule p*'^^ ig optimal, then WIGGINS reaches convergence at iteration j + 1; 
and 

2. if WIGGINS reaches convergence, then the returned schedule p is optimal. 

Proof. From the method of the Lagrange multipliers [H Sect. 5.1], we have that, if a schedule p is optimal, 
then there exists a value A G M such that p and A form a solution to the following system of n -|- 1 equations 
in n -|- 1 unknowns: 

V[coste(p) + A(pi + ... -k p„ - 1)] = 0, (6) 
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where the gradient on the l.h.s. is taken w.r.t. (the components of) p and to A (i.e., has n + 1 components). 
For I < i < n, the *-th equation induced by ^ is 


or, equivalently, 


—costs (p) + A = 0, 




SeJ^ 

S.t.iGS 


( l - 0 ( l - p (^))^)2 


(7) 


The term on the l.h.s. is exactly Wi{p). The (n + l)-th equation of the system ([^ (i.e., the one involving 
the partial derivative w.r.t. A) is 


^P. = 1 . 


( 8 ) 


Consider now the first claim of the theorem, and assume that we are at iteration j such that j is the 
minimum iteration index for which the schedule p*--'^ computed at the end of iteration j is optimal. Then, 
for any i, 1 < i < n, we have 

fC,(p(^)) = A 

because p^-^^ is optimal and hence all identities in the form of 0 must be true. For the same reason, ([^ 
must also hold for p^^\ Hence, for any 1 < * < n, we can write the value pp^^^ computed at the end of 
iteration j + 1 as 

0-+1) ^ PpV.(p(^7) ^ ^ ^ 0-) 

ELiPpV,(pC)) lA 

which means that we reached convergence and wiGGiNS will return which is optimal. 

Consider the second claim of the theorem, and let j be the first iteration for which p^-^^ = pb“i). Then 
we have, for any 1 < i < n, 

(,) ^ pp-^V.(p(^-^)) ^ 0-1) 


This implies 




(9) 


2 = 1 


and the r.h.s. does not depend on i, and so neither does Wi{p^^ Hence we have VFi(p('^ !)) = ...= 
Wn{p^^~^'^) and can rewrite ^ as 

n 

fT,(p(^-i)) = ^pp-i)W,(p(^-i)), 


which implies that the identity ^ holds for p^^ Moreover, if we set 

A = TTi(p(^-i)) 


we have that all the identities in the form of Q hold. Then, and A form a solution to the system 

which implies that is optimal and so must be p^-^^ the returned schedule, as it is equal to 

because wiggins reached convergence. □ 






Algorithm 1: wiGGiNS 


input : T^ tt, c, 0, and maximum number T of iterations 

output: A c-schedule p (with globally minimum 0-cost, in case of convergence) 

1 for i ^ 1 to n do 

2 I Pi ^ 1/n 

3 end 

4 for j ^ 1 to T do 

5 

6 

7 

8 
9 


10 

11 

12 

13 

14 

15 

16 

17 

18 
19 


for i 1 to n do 

I 

end 

tor S gT do 
for i G S do 

I VVt ^ VVi -h (i_6/(1-p(5'))")2 

end 
end 

Pold ^ P 

for z ^ 1 to n do 

end 

if Pold = P then // test for convergence 

I break 
end 


20 end 

21 return p 


4.2 Approximation through Sampling 

We now remove the assumption, not realistic in practice, of knowing the generating process T exactly through 
and TT. Instead, we observe the process using, for a limited time interval, a schedule that iterates over all 
nodes (or a schedule that selects each node with uniform probability), until we have observed, for each time 
step t in a limited time interval [a, 6], the set It generated by F, and therefore we have access to a collection 

I = {Ia,Ia+l, ■ ■ ■ ,Ib}- (10) 


We refer to I as a sample gathered in the time interval [a, &]. We show that a schedule computed with 
respect to a sample I taken during an interval of £{X) = b — a = 0(e~^ log n) steps has cost which is within 
a multiplicative factor e G [0,1] of the optimal schedule. We then adapt wiGGiNS to optimize with respect 
to such sample. 

We start by defining the cost of a schedule w.r.t. to a sample I. 


Definition 1. Suppose p is a c-schedule and I is as in Equation (101, with £(I) = b — a. The 0-cost of p 
w.r.t. to I denoted by costs(p, I) is defined as 


costs (p, I) 


— y ^ 


For 1 < f < n, define now the functions 


W,{p,I) 


1 0c(l - p(^))-^-i 

selves ■ 
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We can then define a variant of wiGGiNS, which we call WIGGINS-APX. The differences from wiGGiNS 

are: 


1. the loop on line in Alg. ^is only over the sets that appear in at least one Ij G X. 

2. WIGGINS-APX uses the values Wi(p,T) (defined above) instead of Wi{p) (line 10 in Alg. [^; 


If WIGGINS-APX reaches convergence, it returns a schedule with the minimum cost w.r.t. the sample X. 
More formally, by following the same steps as in the proof of Thm. we can prove the following result about 
WIGGINS-APX. 


Lemma 2. We have that: 

1. if at any iteration j the schedule has minimum cost w.r.t. X, then wiggins-APX reaches convergence 
at iteration j + 1; and 

2. if WIGGINS-APX reaches convergence, then the returned schedule p has minimum cost w.r.t. X. 

Let £(I) denote the length of the time interval during which X was collected. For a c-schedule p, costs (p, I) 
is an approximation of costs(p), and intuitively the larger f(X), the better the approximation. 

We now show that, if llfX) is large enough, then, with high probability (i.e., with probability at least 
1 — l/n'" for some constant r), the schedule p returned by wiGGiNS-APX in case of convergence has a cost 
costs (p) that is close to the cost costs (p*) of an optimal schedule p*. 


Theorem 3. Let r be a positive integer, and let X be a sample gathered during a time interval of length 


P(r\ > 3(rln(n) -Hn(4)) 


( 11 ) 


Let p* be an optimal schedule, i.e., a schedule with minimum cost. Lf wiGGiNS-APX converges, then the 
returned schedule p is such that 

1 H“ £ 

COSts(p*) < COSts(p) < -COSts(p*) . 

1 — £ 

To prove Thm. we need the following technical lemma. 

Lemma 3. Let p be a c-schedule and X be a sample gathered during a time interval of length 


^ 3(rln(n) -Hln(2)) 


( 12 ) 


where r is any natural number. Then, for every schedule p we have 

Pr(|costs(p,2') - costs(p)| > £ • costs(p)) < ^ . 


Proof. For any S G iF, let Xs be a random variable which is 
otherwise. Since p(5') G [0,1], we have 


l<As< 


1 

1 - 0 


_ 1 _ 

l_S(l-p(S))- 


with probability and zero 


If we let X = then 


costs(p) = E[X] = E ^ E 


(13) 


Let Z = ^('^)- Then we have 


Z < A < 


1-0 


10 









Let Xg be the i-th draw of Xs, during the time interval X it was sampled from, and define X* = ^S- 

We have 

costeip,!) = ■ 

Let now 



By using the Chernoff bound for Binomial random variables EZl Corol. 4.6], we have 
Pr(|coste(p,T) -coste(p)| > ecoste(p)) 

= Pr ^ ^ X* - ^(T)coste(p) 

1-e 


> e£{X)costg{p) j 


= Pr 


< 2 exp 




|Z| 

£^£(J)(1 - 6*)costg(p) \ 


> £/i I <2 exp ( — 


3|Z| 




< 2 exp — 


e'^£{X){l-0) 


where the last inequality follows from the rightmost inequality in (131. The thesis follows from our choice of 
£{X). □ 

We can now prove Thm. 

of Thm. The leftmost inequality is immediate, so we focus on the one on the right. For our choice of £{X) 
we have, through the union bound, that, with probability at least 1 — llrX, at the same time: 


(1 - e)coste(p) < coste(p,I) 
(1 - e)coste(p*) < cost 9 (p*, J) 


< (1 + £)coste(p), and 
< (1 + £)coste(p*) 


(14) 


Since we assumed that WIGGINS-APX reached convergence when computing p, then Thm. holds, and p is 
a schedule with minimum cost w.r.t. X. In particular, it must be 

costs(p,I) < costs(p*,T) ■ 


From this and (141, We then have 

(1 - £)costs(p) < costs(p,I) < costs(p*,X) < (1 + e)costs(p*) 


and by comparing the leftmost and the rightmost terms we get the thesis. 


□ 


4.3 Dynamic Settings 

In this section we discuss how to handle changes in the parameters X and tt as the (unknown) generating 
process F evolves over time. The idea is to maintain an estimation 7r(5') of 7r(S') for each set S G X that we 
discover in the probing process, together with the last time t such that an item {t, S) has been generated 
(and caught at a time t' > t). If we have not caught an item in the form (t", S) in an interval significantly 
longer than l/7r(S'), then we assume that the parameters of F changed. Hence, we trigger the collection of 
a new sample and compute a new schedule as described in Sect. |4.2[ 

Note that when we adapt our schedule to the new environment (using the most recent sample) the 
system converges to its stable setting exponentially (in 9) fast. Suppose L items have been generated since 
we detected the change in the parameters until we adapt the new schedule. These items, if not caught, loose 
their novelty exponentially fast, since after t steps their novelty is at most and decreases exponentially. 
In our experiments (Sect. we provide different examples that illustrate how the load of the generating 
process becomes stable after the algorithm adapts itself to the changes of parameters. 
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4.4 Scaling up with MapReduce 

In this section, we discuss how to adapt WIGGINS-APX to the MapReduce framework [3]. We denote the 
resulting algorithm as WIGGINS-MR. 

In MapReduce, algorithms work in rounds. At each round, first a function map is executed independently 
(and therefore potentially massively in parallel) on each element of the input, and a number (or zero) key- 
value pairs of the form (k, v) are emitted. Then, in the second part of the round, the emitted pairs are 
partitioned by key and elements with the same key are sent to the same machine (called the reducer for that 
key), where a function reduce is applied to the whole set of received pairs, to emit the final output. 

Each iteration of wiGGiNS-APX is spread over two rounds of wiggins-mr. At each round, we assume 
that the current schedule p is available to all machines (this is done in practice through a distributed cache). 
In the first round, we compute the values PiWi, 1 < i < n, in the second round these values are summed to 
get the normalization factor, and in the third round the schedule p is updated. The input in the first round 
are the sets S G I. The function map2(5') outputs, a key-value pair {i,vs) for each i G S, with 

^ 0c{l - p(^))°-^ 

The reducer for the key i receives the pairs {i,vs) for each S gX such that i G S, and aggregates them to 
output the pair {i,gi), with 

9i = Pi'^vs = PiWi . 

The set of pairs {i,gi), 1 < i < n constitutes the input to the next round. Each input pair is sent to the 
same reducerj^ which computes the value 

n n 

9 = ^9i = ^PtW^ 

i=l i=l 

and uses it to obtain the new values Pi = gi/g, for 1 < i < n. The reducer then outputs (t, Pi). At this 
point, the new schedule is distributed to all machines again and a new iteration can start. 

The same results we had for the quality of the final schedule computed by wiGGiNS-APX in case of 
convergence carry over to wiggins-MR. 


5 Experimental Results 

In this section we present the results of our experimental evaluation of wiGGiNS-APX. 

Goals. First, we show that for a given sample I, wiGGiNS-APX converges quickly to a schedule p* that mini¬ 
mizes costg(p,T) (see Thm.|^. In particular, our experiments illustrate that the sequence cost6((p*'^\ir), costs(p^-^^T), ■ • ■ 
is descending and converges after few iterations. Next, we compare the output schedule of wiGGiNS-APX to 
four other schedules: (i) uniform schedules, (ii) proportional to out-degrees, (iii) proportional to in-degrees, 
and (iv) proportional to undirected degrees, i.e., the number of incident edges. Specifically, we compute 
the costs of these schedules according to a sample X that satisfies the condition in Lemma and compare 
them. Then, we consider a specific example for which we know the unique optimal schedule, and show that 
for larger samples wiggins-APX outputs a schedule closer to the optimal. Finally, we demonstrate how our 
method can adapt itself to the changes in the network parameters. 

Environment and Datasets. We implemented wiggins-APX in C-|—1-. The implementation of WIGGINS- 
APX never loads the entire sample to the main memory, which makes it very practical when using large 
samples. The experiments were run on a Opteron 6282 SE CPU (2.6 GHz) with 12GB of RAM. We tested 
our method on graphs from the SNAP repositorj^ (see Table for details). We always consider the graphs 
to be directed, replacing undirected edges with two directed ones. 

®This step can be made more scalable through combiners, an advanced MapReduce feature. 

^http://snap.stanford.edu 
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Datasets 

#nodes 

#edges 

( Fix , Fsoo , Vioo ) 

gen. rate 

Enron-Email 

36692 

367662 

(9,23,517) 

7.22 

Brightkite 

58228 

428156 

(2,7,399) 

4.54 

web-Notredame 

325729 

1497134 

(43,80,1619) 

24.49 

web-Google 

875713 

5105039 

(134,180,3546) 

57.86 


Table 1: 


The datasets, corresponding statistics, and the rate of generating new items at each step. 


Generating process. The generating process T = {T, tt) we use in our experiments (except those in 
Sect. 5.1.11 simulates an Independent-Cascade (IC) model [T^. Since explicitly computing 7r(S') in this case 
does not seem possible, we simulate the creation of items according to this model as follows. At each time 
t, items are generated in two phases: a “creation” phase and a “diffusion” phase. In the creation phase, we 
simulate the creation of “rumors” at the nodes: we flip a biased coin for each node in the graph, where the 
bias depends on the out-degree of the node. We assume a partition of the nodes into classes based on their 
out-degrees, and, we assign the same head probability for the biased coins of nodes in the same class, as 
shown in Table In Table for each dataset we report the size of the classes and the expected number 
of flipped coins with outcome head at each time (rightmost column). Let now u be a node whose coin had 


Class 

Nodes in class 

Bias 

VlK 

{i G E : deg’''(z) > 1000} 

0.1 

E 500 

{i G E : 500 < deg+(z) < 1000} 

0.05 

El 00 

{ieV : 100 < deg+(i) < 500} 

0.01 

Eo 

{z G E : deg’''(z) < 100} 

0.0 


Table 2: Classes and bias for the generating process. 

outcome head in the most recent flip. In the “diffusion” phase we simulate the spreading of the “rumor” 
originating at v through network according to the IC model, as follows. For each directed edge e = u —>■ w we 
fix a probability Pe that a rumor that reached u is propagated through this edge to node w (as in IC model), 
and events for different rumors and different edges are independent. Following the literature [3II1IISJIII1I3I], 
we use p,— 1, ' . If we denote with S the final set of nodes that the rumor created at v reached 

r'u—fuj jgg 

during the (simulated) diffusion process (which always terminates), we have that through this process we 
generated an item (t, S), without the need to explicitly define 7r(S'). 

5.1 Efficiency and Accuracy 

In Sect. |4.l| we showed that when a run of WIGGINS-APX converges (according to a sample I) the computed 
c-schedule is optimal with respect to the sample I (Lemma [^. In our first experiment, we measure the rate 
of convergence and the execution time of wiggins-APX. We fix e = O.I, 9 — 0.75, and consider c S {1,3,5}. 
For each dataset, we use a sample I that satisfies and run wiGGiNS-APX for 30 iterations. Denote the 
schedule computed at round f by pL As shown in Figure the sequence of cost values of the schedules p*’s, 
coste(pbl), converges extremely fast after few iterations. 

For each graph, the size of the sample I, the average size of sets in I, and the average time of each 
iteration is given in Table Note that the running time of each iteration is a function of both sample size 
and sizes of the sets (informed-sets) inside the sample. 

Next, we extract the 1-schedules output by wiGGiNS-APX, and compare its cost to four other natural 
schedules: unif, outdeg, indeg, and totdeg that probe each node, respectively, uniformly, proportional to 
its out-degree, proportional to its in-degree, and proportional to the number of incident edges. Note that 
for undirected graphs outdeg, indeg, and totdeg are essentially the same schedule. 
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cost cost 




Figlir6 The cost of intermediate c-schedules at iterations of WIGGINS-APX according to X. 
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Datasets 


avg. item size 

avg. iter, time (sec) 

Enron-Email 

97309 

12941.33 

204.59 

Brightkite 

63652 

17491.08 

144.35 

web-N otredame 

393348 

183.75 

10.24 

web-Google 

998038 

704.74 

121.88 


TcLbl6 3* Sample size, average size of items in the sample, and the running time of each iteration in WIGGINS-APX (for c = 1). 


To have a fair comparison among the costs of these schedules and wiggins-APX, we calculate their costs 
according to 10 independent samples, that satisfy (12l, and compute the average. The results 

are shown in Table and show that wiGGiNS-APX outperforms the other four schedules. 


Dataset 

WIGGINS-APX 

uniform 

outdeg 

indeg 

totdeg 

Enron-Email 

7.55 

14.16 

9.21 

9.21 

9.21 

Brightkite 

4.85 

9.64 

6.14 

6.14 

6.14 

web-N otredame 

96.10 

97.78 

97.37 

97.43 

97.40 

web-Google 

213.15 

230.88 

230.48 

230.47 

230.47 


Table 4: Comparing the costs of 5 different 1-schedules. 


5.1.1 A Test on Convergence to Optimal Schedule 

Here, we further invetigate the convergence of wiGGiNS-APX, using an example graph and process for which 
we know the unique optimal schedule. We study how close the wiGGiNS-APX output is to the optimal 
schedule when (i) we start from different initial schedules, p°, or (ii) we use samples I’s obtained during 
time intervals of different lengths. 

Suppose G = (V, E) is the complete graph where V = [n]. Let T = (J^, tt) for = {S' G 2["1 | 1 < |S| < 2}, 
and 7r(S) = |^. It is easy to see that coste(p) is a symmetric function, and thus, the uniform schedule is 
optimal. Moreover, by Corollary the uniform schedule is the only optimal schedule, since {u} G E for 
every v £ V. Furthermore, we let 0 = 0.99 to increase the sample complexity (as in Lemma and make it 
harder to learn the uniform/optimal schedule. 

In our experiments we run the wiGGiNS-APX algorithm, using (i) different random initial schedules, and 
(ii) samples I obtained from time intervals of different lengths. For each sample, we run wiGGiNS-APX 
10 times with 10 different random initial schedules, and compute the exact cost of each schedule, and its 
variation distance to the uniform schedule. Our results are plotted in Figure and as shown, by increasing 
the sample size (using longer time intervals of sampling) the output schedules gets very close to the uniform 
schedule (the variance gets smaller and smaller). 

5.2 Dynamic Settings 

In this section, we present experimental results that show how our algorithm can adapt itself to the new 
situation. The experiment is illustrated in Fig. For each graph, we start by following an optimal 1- 
schedule in the graph. At the beginning of each “gray” time interval, the labels of the nodes are permuted 
randomly, to impose great disruptions in the system. Following that, at the beginning of each “green” time 
interval our algorithm starts gathering samples of F. Then, wiGGiNS-APX computes the schedule for the 
new sample, using 50 rounds of iterations, and starts probing. The length of each colored time interval is 
R = ^0og^("Hiog(2)) ^ g _ Q 5 ^ motivated by Theorem 

Since the cost function is defined asymptotically (and explains the asymptotic behavior of the system 
in response to a schedule), in Figure]^ we plot the load of the system Lg{t) over the time (blue), and the 
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Figure 2: 


The cost of WIGGINS-APX outputs and their variation distance to the optimal schedule. 


average load in the normal and perturbed time intervals (red). Based on this experiment, and as shown 
in Figure after adapting to the new schedule, the effect of the disruption caused by the perturbation 
disappears immediately. Note that when the difference between the optimal cost and any other schedule is 
small (like web-Notredame), the jump in the load will be small (e.g., as shown in Figure [^and Table]^ the 
cost of the initial schedule for web-Notredame is very close the optimal cost, obtained after 30 iteration). 

6 Conclusions 

We formulate and study the {6, c)-Optimal Probing Schedule Problem, which requires to find the best probing 
schedule that allows an observer to find most pieces of information recently generated by a process F, by 
probing a limited number of nodes at each time step. 

We design and analyze an algorithm, wiGGiNS, that can solve the problem optimally if the parameters 
of the process F are known, and then design a variant that computes a high-quality approximation of the 
optimum schedule when only a sample of the process is available. We also show that WIGGINS can be adapted 
to the MapReduce framework of computation, which allows us to scale up to networks with million of nodes. 
The results of experimental evaluation on a variety of graphs and generating processes show that wiGGiNS 
and its variants are very effective in practice. 

Interesting directions for future work include generalizing the problem to allow for non-memoryless sched¬ 
ules and different novelty functions. 
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Figure 3: 


Perturbation, Sampling, and Adapting (For details see Section 


5.2 
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